Two types of errors

I posit there are two types of errors that arise from geocoding. One is where the API totally misinterprets the input and geocodes an address to a latitude and longitude that aren’t even in the city of Austin. I’ll call that “Wrong City” error. The other is when the police record contains an address that would be considered ambiguous even to human eyes, for instance a street name with no street or block number. Then the geocoder is forced to cite a precise location and hence is vulnerable to a possibly large discrepancy. I’ll call this “Ambiguous Address” error.

How accurate are my geocoders?

I’ve mostly stuck to Google Maps API through the ggmap package, but I also uploaded a spreadsheet of 3,500 unique addresses to geocod.io to test their reverse geocoding. The output was nicer than ggmaps in that it has a confidence score and a census tract. It cost me $2.12. Let’s peak at the data:

# helper function
showTable <- . %>%
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

geocodio <- vroom(here("data/geocodio_trial_run.csv.gz")) %>% clean_names()

geocodio %>%
  select(1:5, city, census_tract_code) %>%
  head() %>%
  showTable()
address latitude longitude accuracy_score accuracy_type city census_tract_code
700 BLOCK EAST 10TH STREET Austin, TX 30.32637 -97.77126 0.33 place Austin 000102
500 BLOCK EAST 6TH STREET Austin, TX 30.32637 -97.77126 0.33 place Austin 000102
6TH STREET / TRINITY STREET Austin, TX 30.26707 -97.73933 0.80 intersection Austin 001100
1200 BLOCK NORTH I H 35 Austin, TX 30.32637 -97.77126 0.33 place Austin 000102
1500 BLOCK I H 35 Austin, TX 30.32637 -97.77126 0.33 place Austin 000102
600 BLOCK RED RIVER STREET Austin, TX 30.32637 -97.77126 0.33 place Austin 000102

Quality of the Geocod.io Data

Distinct cities

So a first sanity check for “Wrong City” error would be to see if there are any addresses not geocoded to the city of Austin:

geocodio %>% distinct(city)
## # A tibble: 25 x 1
##    city        
##    <chr>       
##  1 Austin      
##  2 Round Rock  
##  3 San Antonio 
##  4 Jarrell     
##  5 Marble Falls
##  6 Pflugerville
##  7 Smithville  
##  8 San Marcos  
##  9 Cedar Park  
## 10 Georgetown  
## # … with 15 more rows

Check Lat/Lon against Austin City Limits

These are Texas cities, but they are definitely not suburbs of Austin. The austin_city_limits.shp file contains geometries for the city council districts in Austin. Let’s use a rough polygon of those districts to see how often geocod.io geocoded to a point outside of Austin:

# using the sf package here
austin_city_limits <-
  st_read(here("data/austin_city_limits.shp")) %>%
  st_union() %>%
  st_convex_hull()
## Reading layer `austin_city_limits' from data source `/Users/nate/workspace/Austin Homelessness/data/austin_city_limits.shp' using driver `ESRI Shapefile'
## Simple feature collection with 199 features and 9 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -98.01504 ymin: 30.04298 xmax: -97.48034 ymax: 30.51685
## geographic CRS: WGS84(DD)

# create an "sf" object which is just a dataframe with geometries attached
geocodio_geometry <- geocodio %$% map2(longitude, latitude, ~ st_point(c(.x, .y)))
geocodio %<>% st_sf(geometry = geocodio_geometry, crs = st_crs(austin_city_limits))
geocodio %<>% cbind(within_austin = st_intersects(geocodio, austin_city_limits, sparse = F))

# how many points were not in Austin?
geocodio %>%
  count(within_austin) %>%
  as_tibble() %>%
  select(within_austin, n)
## # A tibble: 2 x 2
##   within_austin     n
##   <lgl>         <int>
## 1 FALSE           128
## 2 TRUE           3239

Peak at points outside city limits

That doesn’t look too bad… about 4% of the geocoded addresses are outside of the city of Austin. I’m comfortable dropping them, but let’s take a closer look at the data and see if we can learn anything about the addresses that are getting geocoded to a location outside of the city of Austin.

# let's view a few of the points that are not in Austin
geocodio %>%
  dplyr::filter(!within_austin) %>%
  select(address, accuracy_score, city, county) %>%
  head(10) %>%
  showTable()
address accuracy_score city county geometry
110 NORTH IH 35 SERVICE ROAD Austin, TX 0.90 Round Rock Williamson County POINT (-97.6872 30.5083)
8TH AT COLORADO Austin, TX 1.00 San Antonio Bexar County POINT (-98.48153 29.43139)
N. IH35/WF Austin, TX 0.68 Jarrell Williamson County POINT (-97.60652 30.8247)
N. IH35 S/B/E. CESAR CHAVEZ Austin, TX 0.68 San Antonio Bexar County POINT (-98.47967 29.4381)
E. 11TH STREET/WF/IH35 Austin, TX 0.76 Marble Falls Burnet County POINT (-98.26873 30.57753)
W. 3RD STREET/LAVACA STREET Austin, TX 0.80 Smithville Bastrop County POINT (-97.16934 30.01062)
NORTH IH35 AND EAST 12TH STREET Austin, TX 0.15 San Marcos Hays County POINT (-97.92301 29.88173)
12TH STREET/IH35 SB Austin, TX 0.82 San Antonio Bexar County POINT (-98.48095 29.43759)
6th At Neches Street Austin, TX 1.00 Georgetown Williamson County POINT (-97.68081 30.63813)
12th Street And Ih35 South Bound Austin, TX 0.82 San Antonio Bexar County POINT (-98.48095 29.43759)

Plot points for perspective

As a native Austinite I can locate these addresses within the city of Austin. Non-natives may not realize how wrong the given estimates are. Let’s plot the points to get a better sense:

# exclude anything from the plot that is obviously wrong (two points were located in Boston :( )
geocodio_minus_boston <- geocodio %>% dplyr::filter(longitude < -90)

stamen_map <- st_bbox(geocodio_minus_boston) %>%
  as.numeric() %>%
  get_stamenmap()

ggmap(stamen_map) +
  geom_sf(data = geocodio_minus_boston, inherit.aes = F, color = "#F54D97")

Quality of Google Maps data

Let’s do exactly what we did with geocod.io and see how good Google Maps did. First, let’s peak at the shape of the data and then create sf POINT objects like we did before.

google_maps <- vroom(here("data/google_cache.csv.gz")) %>% clean_names()
google_maps %>%
  head(5) %>%
  showTable()
google_search lon lat
115 SANDRA MURAIDA WAY Austin, TX -97.75483 30.26827
12600 BLOCK NORTH IH 35 Austin, TX -97.71769 30.29847
800 BLOCK EMBASY DRIVE Austin, TX -97.73227 30.26757
800 BLOCK ELM BERRY Austin, TX -97.74475 30.26461
WEST 5TH STREET and SOUTH MOPAC EXPRESSWAY Austin, TX -97.76446 30.27393
# create an "sf" object which is just a dataframe with geometries attached
maps_geometry <- google_maps %$% map2(lon, lat, ~ st_point(c(.x, .y)))
google_maps %<>% st_sf(geometry = maps_geometry, crs = st_crs(austin_city_limits))
google_maps %<>% cbind(within_austin = st_intersects(google_maps, austin_city_limits, sparse = F))

# how many points were not in Austin?
google_maps %>%
  count(within_austin) %>%
  as_tibble() %>%
  select(within_austin, n)
## # A tibble: 2 x 2
##   within_austin     n
##   <lgl>         <int>
## 1 FALSE           116
## 2 TRUE           3576

Plot points outside city limits

Peaking at the data for “Wrong City” points here would not be very informative since we just have lat and lon and no other identifying information. Let’s just skip straight to the plot:

ggmap(stamen_map) +
  geom_sf(data = google_maps, inherit.aes = F, color = "#F54D97")

Conclusions on data quality

There were definitely errors that both APIs committed. From the maps it doesn’t look like they’re committing the “same type” of error, i.e. it looks like errors are clustered in differenct places (in San Antonio for geocod.io, and along I-35 for Google Maps). So perhaps we can conclude that in any one instance, one of the answers is “good enough”.

Let’s build a dataset that contains the distance between the two APIs for datapoints. That way, we can look at addresses where geocod.io and Google Maps really disagree.

citations_by_tract <- vroom(here("data/citations_by_tract.csv.gz"))

comparison_dataset <- citations_by_tract %>%
  distinct(google_search) %>%
  inner_join(google_maps, c("google_search")) %>%
  inner_join(geocodio, c("google_search" = "address"))

distance <- ~ st_distance(.x, .y) %>%
  units::set_units(miles) %>%
  as.numeric()
comparison_dataset %<>% mutate(distance_miles = map2_dbl(geometry.x, geometry.y, distance)) %>% rename(within_austin_google = within_austin.x, within_austin_geocodio = within_austin.y)

# let's see if we can construct a table of accuracy by within_austin_XXX
comparison_dataset %>%
  group_by(within_austin_google, within_austin_geocodio) %>%
  summarise(count = n(), `Miles 25%` = quantile(distance_miles, 0.25), `Miles 50%` = median(distance_miles), `Miles 75%` = quantile(distance_miles, 0.75)) %>%
  showTable()
within_austin_google within_austin_geocodio count Miles 25% Miles 50% Miles 75%
FALSE FALSE 7 0.2408126 0.7046287 1.3759725
FALSE TRUE 84 0.5113389 0.6710727 1.1977254
TRUE FALSE 102 0.4425330 0.5751898 0.6283034
TRUE TRUE 2713 0.0001669 0.0155369 0.0675446
ggplot(comparison_dataset) +
  geom_histogram(aes(distance_miles, fill = glue("{within_austin_geocodio}{within_austin_google}")), bins = 75) +
  facet_wrap(~ within_austin_google + within_austin_geocodio, ncol = 1, scales = "free", labeller = "label_both") +
  theme_fivethirtyeight() +
  theme(legend.position = "none") +
  labs(title = "Distance in miles between Google and Geocod.io Estimate", subtitle = "Cross-tabulated by \"Wrong City\" errors*", caption = "*Note scale differences")

Heuristic Approach

The tables suggest a heuristic approach: First, there is probably very little “Ambiguous Address” bias, as shown by the mean distance when both geocoders are within Austin City Limits.

I will take being inside Austin City Limits as a sufficient condition for accuracy, then. That means that, using Google Maps API as the initial source of truth,

  1. Is the Google Maps latitude/longitude inside city limits?
    • if yes, then keep this lat/lon
  2. Is there a geocodio address search we can cross-reference?
    • if no, then discard observation
  3. Is the geocodio latitude/longitude inside city limits?
    • if yes, then keep this lat/lon
    • if no, then discard observation